Generating image captions with the BLIP model

Introducing: Hugging Face, Tranformers, and BLIP

Hugging Face is an organization that focuses on natural language processing (NLP) and artificial intelligence (AI).The organization is widely known for its open-source library called "Transformers" which provides thousands of pre-trained models to the community. The library supports a wide range of NLP tasks, such as translation, summarization, text generation, and more. Transformers has contributed significantly to the recent advancements in NLP, as it has made state-of-the-art models, such as BERT, GPT-2, and GPT-3, accessible to researchers and developers worldwide.

Tranformers library includes a model that can be used to capture information from images. The BLIP, or Bootstrapping Language-Image Pre-training, model is a tool that helps computers understand and generate language based on images. It's like teaching a computer to look at a picture and describe it, or answer questions about it.

Alright, now that you know what BLIP can do, let's get started with implementing a simple image captioning AI app!

Step 1: Import your required tools from the transformers library

You have already installed the package transformers during setting up the environment.

In the project directory, create a Python file, Click on File Explorer, then right-click in the explorer area and select New File. Name this new file image_cap.py. copy the various code segments below and paste them into the Python file.

chatbot
New file

You will be using AutoProcessor and BlipForConditionalGeneration from the transformers library.

"Blip2Processor" and "Blip2ForConditionalGeneration" are components of the BLIP model, which is a vision-language model available in the Hugging Face Transformers library.

  • AutoProcessor : This is a processor class that is used for preprocessing data for the BLIP model. It wraps a BLIP image processor and an OPT/T5 tokenizer into a single processor. This means it can handle both image and text data, preparing it for input into the BLIP model.

    Note: A tokenizer is a tool in natural language processing that breaks down text into smaller, manageable units (tokens), such as words or phrases, enabling models to analyze and understand the text.

  • BlipForConditionalGeneration : This is a model class that is used for conditional text generation given an image and an optional text prompt. In other words, it can generate text based on an input image and an optional piece of text. This makes it useful for tasks like image captioning or visual question answering, where the model needs to generate text that describes an image or answer a question about an image.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  1. import requests
  2. from PIL import Image
  3. from transformers import AutoProcessor, BlipForConditionalGeneration
  4. # Load the pretrained processor and model
  5. processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
  6. model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

Step 2: Fetch the model and initialize a tokenizer

After loading the processor and the model, you need to initialize the image to be captioned. The image data needs to be loaded and pre-processed to be ready for the model.

To load the image right-click anywhere in the Explorer (on the left side of code pane), and click Upload Files... (shown in image below). You can upload any image from your local files, and modify the img_path according to the name of the image.

open terminal
Upload files

In the next phase, you fetch an image, which will be captioned by your pre-trained model. This image can either be a local file or fetched from a URL. The Python Imaging Library, PIL, is used to open the image file and convert it into an RGB format which is suitable for the model.

  1. 1
  2. 2
  3. 3
  4. 4
  1. # Load your image, DONT FORGET TO WRITE YOUR IMAGE NAME
  2. img_path = "YOUR IMAGE NAME.jpeg"
  3. # convert it into an RGB format
  4. image = Image.open(img_path).convert('RGB')

Next, the pre-processed image is passed through the processor to generate inputs in the required format. The return_tensors argument is set to "pt" to return PyTorch tensors.

  1. 1
  2. 2
  3. 3
  1. # You do not need a question for image captioning
  2. text = "the image of"
  3. inputs = processor(images=image, text=text, return_tensors="pt")

You then pass these inputs into your model's generate method. The argument max_new_tokens=50 specifies that the model should generate a caption of up to 50 tokens in length.

The two asterisks (**) in Python are used in function calls to unpack dictionaries and pass items in the dictionary as keyword arguments to the function. **inputs is unpacking the inputs dictionary and passing its items as arguments to the model.

  1. 1
  2. 2
  1. # Generate a caption for the image
  2. outputs = model.generate(**inputs, max_length=50)

Finally, the generated output is a sequence of tokens. To transform these tokens into human-readable text, you use the decode method provided by the processor. The skip_special_tokens argument is set to True to ignore special tokens in the output text.

  1. 1
  2. 2
  3. 3
  4. 4
  1. # Decode the generated tokens to text
  2. caption = processor.decode(outputs[0], skip_special_tokens=True)
  3. # Print the caption
  4. print(caption)

Save your Python file and run it to see the result.

  1. 1
  1. python3 image_cap.py

And you have the image's caption, generated by your model! This caption is a textual representation of the content of the image, as interpreted by the BLIP model.

open terminal
Caption output